This data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 24 categories across Europe are considered. Google user rating ranges from 1 to 5 and average user rating per category is calculated.
Input variables:
1) User Unique user id
2) Attribute 1: Average ratings on churches
3) Attribute 2: Average ratings on resorts
4) Attribute 3: Average ratings on beaches
5) Attribute 4: Average ratings on parks
6) Attribute 5: Average ratings on theatres
7) Attribute 6: Average ratings on museums
8) Attribute 7: Average ratings on malls
9) Attribute 8: Average ratings on zoo
10) Attribute 9: Average ratings on restaurants
11) Attribute 10: Average ratings on pubs/bars
12) Attribute 11: Average ratings on local services
13) Attribute 12: Average ratings on burger/pizza shops
14) Attribute 13: Average ratings on hotels/other lodgings
15) Attribute 14: Average ratings on juice bars
16) Attribute 15: Average ratings on art galleries
17) Attribute 16: Average ratings on dance clubs
18) Attribute 17: Average ratings on swimming pools
19) Attribute 18: Average ratings on gyms
20) Attribute 19: Average ratings on bakeries
21) Attribute 20: Average ratings on beauty & spas
22) Attribute 21: Average ratings on cafes
23) Attribute 22: Average ratings on view points
24) Attribute 23: Average ratings on monuments
25) Attribute 24: Average ratings on gardens
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set default setting of seaborn
sns.set()
|
Read the data using read_csv() function from pandas
|
# read the data
raw_data = pd.read_csv('c:/users/supriya/OneDrive/Desktop/PGA 3.0/New folder/2 Unsupervised Learning-20230130T070554Z-001/2 Unsupervised Learning/3 Heirarchical Clustering/2 Project/Dataset/google_review_ratings.csv')
# print the first five rows of the data
raw_data.head()
# check the data types for variables
raw_data.info()
|
Features that are named as Categories, so we need to rename category with it respective label. Also We will drop 'Unnamed: 25' feature as it is redundant
|
Before manipulation of data, let's check dimension of dataset
# get the shape
print(raw_data.shape)
We see that dataframe has 5456 instances and 26 features
Droping redundant feature
# Using drop() function to remove redundant feature
data = raw_data.drop(['Unnamed: 25'] ,axis = 1)
data.head()
Renaming features
To perform the action we will create list containing actual names of features
column_names = ['user_id', 'churches', 'resorts', 'beaches', 'parks',
'theatres', 'museums', 'malls', 'zoo', 'restaurants',
'pubs_bars', 'local_services', 'burger_pizza_shops',
'hotels_other_lodgings', 'juice_bars', 'art_galleries',
'dance_clubs', 'swimming_pools', 'gyms', 'bakeries',
'beauty_spas', 'cafes', 'view_points', 'monuments', 'gardens']
Applying above columns names to dataframe columns
data.columns = column_names
Checking if changes are applied or not
data.head()
Columns are renamed
data.info()
|
Since local_services is numeric, it is represented as categorical
|
Changing data type of local_services
data['local_services'] = pd.to_numeric(data['local_services'],errors = 'coerce')
|
We have encountered an error specifying that features have a invalid string. So to change data type we need to release this error
|
data['local_services'].unique() # Getting all unique values
|
There is a data as '2\t2.' which cannot be converted to float. So need to remove it
|
Drop rows where local_services is equal to '2\t2.'
# Get row number where local services is invalid
data[data['local_services'] == '2\t2.']['local_services']
Droping row 2712
data = data.drop(data[data['local_services'] == '2\t2.'].index)
Now, we can changing data type of local_services
data[['local_services']] = data[['local_services']].apply(pd.to_numeric)
data.info()
All the features are numeric except the user_id as it is categorical in nature
data_manipulated = data.copy(deep =True ) # Creating a copy of dataframe
# data frame with numerical features
data_manipulated.describe()
|
The above output illustrates the summary statistics of all the numeric variables like the mean, median(50%), minimum, and maximum values, along with the standard deviation.
|
# data frame with categorical features
data.describe(include='object')
|
user_id is unique for all instances
|
If the missing values are not handled properly we may end up drawing an inaccurate inference about the data. Due to improper handling, the result obtained will differ from the ones where the missing values are present.
|
In order to get the count of missing values in each column, we use the in-built function .isnull().sum()
|
# get the count of missing values
missing_values = data_manipulated.isnull().sum()
# print the count of missing values
print(missing_values)
|
There is only one missing value, so rplacing it with mean
|
data_no_missing = data_manipulated.fillna(data_manipulated.mean())
# Rechecking missing values
missing_values = data_no_missing.isnull().sum()
# print the count of missing values
print(missing_values)
There are no missing values present in the data.
PDF's of features
fig = data_no_missing.hist(figsize = (18,18))
Lets check if users have rated all the features
data_description = data_no_missing.describe()
rated = data_description.loc['min'] > 0
rated[rated]
|
The above 11 features have been given a rating by all the users as the minimum value is greater than 0
|
Visualizing number of reviews for each category
# Creating the dataframe containg number of review for each feature
reviews = data_no_missing[column_names[1:]].astype(bool).sum(axis=0).sort_values()
reviews
column_names = data_no_missing.columns.values
plt.figure(figsize=(10,7))
plt.barh(np.arange(len(column_names[1:])), reviews.values, align='center', alpha=0.5)
plt.yticks(np.arange(len(column_names[1:])), reviews.index)
plt.xlabel('No of reviews')
plt.ylabel('Categories')
plt.title('No of reviews under each category')
Now let's check how many users have given reviews to the features
# Creating a dataframe to store number of reviews by users
no_of_reviews = data_no_missing[column_names[1:]].astype(bool).sum(axis=1).value_counts()
no_of_reviews
# Plotting the number of customers vs numbe of review
plt.figure(figsize=(10,7))
plt.bar(np.arange(len(no_of_reviews)), no_of_reviews.values, align='center', alpha=0.5)
plt.xticks(np.arange(len(no_of_reviews)), no_of_reviews.index)
plt.ylabel('No of reviews')
plt.xlabel('No of categories')
plt.title('No of Categories vs No of reviews')
Conclusion
Around 3500 users have given a rating for all the 24 categories and the least no of rating given by a user is 15. So for users with lesser number of ratings a recommender system can be built
Now let's check Average rating per feature
# Creating a dataframe to store average rating for each feature
avg_rating = data_no_missing[column_names[1:]].mean() # average rating is calculated by dividing count by it's mean
avg_rating = avg_rating.sort_values() # sorting the rating in increasing order
avg_rating
# Plotting average rating plots
plt.figure(figsize=(10,7))
plt.barh(np.arange(len(column_names[1:])), avg_rating.values, align='center', alpha=0.5)
plt.yticks(np.arange(len(column_names[1:])), avg_rating.index)
plt.xlabel('Average Rating')
plt.title('Average rating per Category')
Creating a id column
data_1 = data_no_missing.copy()
new = data_1['user_id'].str.split(' ',n=2,expand=True)
data_1['user'] = new[0]
data_1['id'] = new[1]
data_1 = data_1.drop(['user_id','user'],axis=1)
data_1.head()
data_final = data_1.copy(deep = True)
data_final.head()
import relevant packages
import scipy.cluster.hierarchy as sch
from sklearn.preprocessing import scale as s
from scipy.cluster.hierarchy import dendrogram, linkage
Z = sch.linkage(data_final,method='ward')
Z
# Creating and plotting a dendogram
den = sch.dendrogram(Z)
plt.tick_params(
axis='x',
which='both',
bottom=False,
top=False,
labelbottom=False)
plt.title('Hierarchical Clustering')
|
The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold. So we need to determine the cutting line.
|
Creating a function to determine the cutting line
def fd(*args, **kwargs):
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata
Creating a dendogram with cutting line
Observing the height of each dendrogram division we decided to go with 80000 where the line would be drawn and 30000 to determine the dendrogram nodes
fd(Z,leaf_rotation=90.,show_contracted=True,annotate_above=30000,max_d=80000)
plt.tick_params(
axis='x',
which='both',
bottom=False,
top=False,
labelbottom=False)
|
We can see that there are basically 2 clusters possible
|
Creating a hierarchical clustering model
# Importing packages
from sklearn.cluster import AgglomerativeClustering
# Creating a Agglomerative Clustering
hc_model = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage ='ward')
# Fitting the model
y_cluster = hc_model.fit_predict(data_final)
Adding the cluster column
data_clustered = data_final.copy()
data_clustered["Cluster"] = y_cluster.astype('object')
data_clustered.head()
Visualizing the clusters
cols = list(data_final.columns)
#cols.remove("user_id")
cols
sns.pairplot(data_clustered, hue="Cluster", diag_kind="hist")